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Abstract 






Previous studies have indicated that the reliability of test scores composed of testlets is 
overestimated by conventional item-based reliability estimation methods (Sireci, Thissen & Wainer, 

1991; Wainer, 1995; Wainer & Thissen, 1996; Lee & Frisbie, in press). In light of these previous studies, 
it seems reasonable to ask whether the item-based estimation methods for the conditional standard error 
of measurement (SEM) would provide underestimates for tests composed of testlets. The primary 
purpose of this study was to investigate the appropriateness and implication of incorporating a testlet 
definition into the estimation procedures of the conditional SEM for tests composed of testlets. Another 
purpose was to investigate the bias in estimates of the conditional SEM when using item-based methods 
instead of testlet-based methods. Several estimation procedures were proposed and compared in 
estimating conditional SEM for tests composed of testlets. 
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Conditional Standard Errors of Measurement 
for Tests Composed of Testlets 



1 



Testlets, as the name implies, have been defined as small tests (Wainer & Kiely, 1987; Wainer & 
Lewis, 1990). Previous studies have indicated that the reliability of test scores composed of testlets is 
overestimated by conventional item-based reliability estimation methods (Sireci, Thissen & Wainer, 

1991; Wainer, 1995; Wainer & Thissen, 1996; Lee & Frisbie, in press). That is, when subgroups of items 
in a test are related to the same passage or other stimulus material, there might be statistical 
dependence among those items, causing an item-based reliability estimate to be inflated relative to an 
estimate of reliability based on the correlation between equivalent forms (Lawrence, 1995). In light of 
these previous studies, it seems reasonable to ask whether the item-based estimation methods for the 
conditional standard error of measurement (conditional SEM) would provide underestimates for tests 
composed of testlets. This question was the main motivation for doing this study. 

When measurement models are applied in practical situations, some statistical assumptions 
must be made, such as conditional independence (or uncorrelated errors) and unidimensionality. 

Because the unidimensional measurement models based on dichotomously scored items are frequently 
used for practical applications, it is important to study the robustness of these models to violation of 
their assumptions in various applied contexts. Previous studies have shown that the assumptions for 
measurement are frequently violated by tests composed of testlets (Sireci, Thissen & Wainer, 1991; 
Wainer, 1995; Wainer & Thissen, 1996; Lee & Frisbie, in press; Lee, Kolen, Frisbie & Ankenmann, 1998). 
Therefore, applying unidimensional measurement models based on dichotomously scored items to 
estimating conditional SEM for tests composed of testlets might be inappropriate. Because there is little 
evidence in the literature about how the violation of assumptions affects estimates of conditional SEM, it 
is not clear how serious the degree of distortion of the conditional SEM estimates might be. 

The primary purpose of this study was to investigate the appropriateness and implication of 
incorporating a testlet definition into the estimation procedures of the conditional SEM for tests 
composed of testlets. This study also investigated the bias in the estimates of the conditional SEM based 
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on using item-based methods instead of testlet-based methods when the assumptions required by 
measurement modeling have been violated. 

The objectives of this study were: 

1. To investigate the relative appropriateness of each of several methods by making a comparison 
between the prespecified true conditional SEM and the estimates obtained from each method. 

2. To assess the relative magnitude of bias introduced by using each method in estimating the 

conditional SEM for tests composed of testlets. 

3. To examine the robustness of the item-based methods with respect to violation of the 
conditional independence assumption in estimating the conditional SEM for tests composed of testlets 

4. To investigate the relationship between the degree of violation of the conditional independence 
assumption and the degree of bias in estimates of the conditional SEM. 

Methods of Estimating Conditional SEM 

In classical test theory, the standard error of measurement is estimated by <£ e = S% V ' ~ P.v.v , 
where Sx is the standard deviation of a set of test scores and Pxv is the reliability estimate for those 
test scores. This formula, which can be viewed as an average standard error of measurement, provides 
one estimate for all examinees, regardless of their score level (Qualls-Payne, 1992). However, it is 
reasonable to expect that the amount of error associated with individual scores could vary depending on 
where the true score is located on the score scale. 

Since the first edition of the Test Standards , the American Psychological Association, American 
Educational Research Association and National Council on Measurement in Education (1954), have 
recommended that test publishers estimate and report the standard error of measurement at several 
points on the score scale. The current version, Standards for Educational an d Psyc hological Testing 
(American Educational Research Association, American Psychological Association & National Council on 
Measurement in Education, 1985), also included this recommendation in Standard 2.10 (p.22). 

A number of methods have been developed to estimate the conditional SEM. The earliest 
investigators about the conditional SEM were probably Mollenkopf (1949) and Thorndike (1951). Lord 
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(1955, 1957) developed the best-known conditional SEM estimation formula using binomial error theory. 
Feldt (1984) provided another estimation method using a compound binomial error model, which 
presumes that parallel forms involve stratified random samples of items. An item response theory (IRT) 
approach to estimating the conditional SEM was provided by Lord (1980), and recently a generalizability 
theory (G-theory) approach was presented by Brennan (1998). These methods can be thought of as the 
fundamental frameworks for estimating conditional SEMs, and several variations of these basic 
frameworks may be possible. A comprehensive review of most of these and related methods is 
summarized in Feldt & Brennan (1989) and Feldt & Qualls (1996). 

However, the issues related to estimating the conditional SEM for tests composed of testlets have 
not been addressed so far. (Brennan, 1998, investigated this issue under a generalizability theory 
framework, however, he did not mention the testlet concept explicitly.) The estimation methods for the 
conditional SEM were classified in this study as either item-based or testlet-based. The IRT and G- 
theory approaches were considered for estimating the conditional SEM for each item-based and testlet- 
based method. Because Lord’s binomial error model (1955, 1957) and Feldt’s compound binomial error 
model (1984) are special cases of the G-theory approach for estimating the conditional SEM (Brennan, 
1998), the IRT and G-theory approaches together include almost all basic formulas mentioned above, 
except variations from Thorndike’s (1951) and Mollenkopfs (1949) methods. 

Two item-based estimation methods were considered: (a) A G-theory approach with a pxl design 
[pxl method], where p represents persons, the object of measurement, and I represents the item facet, 
and (b) a dichotomous IRT approach [DIRT method]. A G-theory approach with a px(I:H) design [px(I:H) 
method], where p represents persons, H represents the passage facet, and I represents the item facet 
within a passage, and polytomous IRT approaches for estimating conditional SEM using both Samejima’s 
(1969) graded response model [GIRT method] and Bock’s (1972) nominal model [NIRT method] were 
used as the testlet-based estimation methods. 
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Item-Based Methods 

These methods, which assume that the appropriate measurement unit is an item, have been used 
most frequently for estimating the conditional SEM. Here, it is assumed that items are scored 
dichotomously, although the underlying methodology per se makes no such assumption (Brennan, 1998). 

pxl Method 

The pxi G-theory design is appropriate for estimating the conditional SEM where i represents an 
item facet composed of an infinite, undifferentiated set of items, and p represents an object of 
measurement, a person in this case. Typically, it is assumed that the objects of measurement “facet” is 
infinite. Let Xpj denote an observed score for person p on item i. Then, the Xpj can be represented as: 

Xpj = 1 1 (grand mean) [1] 

+|a p — (J. (person effect) 

+m — \x (item effect) 

+ Xpj — \i p - \i i + \i (residual effect). 

In this linear model for a pxi G-study design, the decomposition in Equation [1] is for single person— item 
combinations. Therefore, estimated variance components from a G-study are also for single items. 
However, decisions are to be based on a total (or mean) score for a set of items. The linear model for such 
a mean score is based on a pxi D-study design, and a linear model for a D-study design is the same as in 
Equation [1], except for replacing i with I in all terms containing i. So, the variance components in a D- 
study are for a set of items and not for a single item. 

Two types of decisions can be differentiated in the G-theory framework: relative and absolute 
decisions. Corresponding to these two types of decisions, two types of errors can also be differentiated: 
relative and absolute errors (Cronbach, Gleser, Nanda & Rajaratnam, 1972; Shavelson & Webb, 1991; 
Brennan, 1992). In this study, only absolute errors are considered in comparing various estimation 
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methods because the most other methods are based only on the absolute error definition. The absolute 
conditional SEM for person p can be estimated by 
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<*( A )p = ' 



EtXpi-Xp.) 2 

I'(I -1) 



[ 2 ] 



where Xp] is person p's mean score over I items, I is the number of items in the G-Study, and I' is the 

number of items in the D-Study (Brennan, 1998). 

DIRT Method 

This method is based on an item response curve, representing the probability that individual 
person k with ability score 0 k will answer item i correctly, denoted F^(9 k ) . In this study, the three 
parameter logistic model was used for obtaining the item response curve. To estimate the conditional 
SEM using an IRT approach, it is necessary to obtain the distribution of the number-correct raw scores 
given IRT ability (0 ) with estimated item parameters (Kolen, Zeng & Hanson, 1996). The probability of 
random variable X representing a certain raw score on a K-item test for ability 0 can be denoted as 
P(X = i|9) , where i ranges from 0 to K. This notation expresses the conditional distribution of the 
number-correct raw scores for a given ability level. Lord & Wingersky (1984) provided a recursion 
formula to calculate these probabilities: 

P(X = i|0) = P(X r _, = i|0)[l - P(0)] i = 0 

= P(X r _, =i|0)[l-P(0)] + P(^ r _ 1 =i-l|0)P(0) 0 < I < r 

= P(X r _, =i-l|0)P(0) i = r. 

The variance of the resulting distribution is the conditional error variance of the number-correct raw 

scores for ability 0 . Therefore, the conditional SEM for a given 0 can be estimated by taking the square 
root of this conditional error variance (Kolen, Zeng & Hanson, 1996). 
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Tfistlet-Based Methods 

The testlet concept has been recommended as a useful tool for solving the problems arising from 
the situations in which the conditional independence assumption among items is violated. (Thissen, 
Steinberg & Mooney, 1989; Sired, Thissen & Wainer, 1991; Wainer, Sireci & Thissen, 1991; Yen, 1993; 
Wainer, 1995; Wainer & Thissen, 1996, Lee, Kolen, Frisbie & Ankenmann, 1998). The polytomous IRT 
approaches incorporate this recommendation. The G-theory approach, however, can take the passage (or 
testlet) facet into account as another source of variation (Lee & Frisbie, in press). 

NIRT Method 

With respect to testlet applications, Bock’s nominal model has been used predominantly (Wainer 
& Thissen, 1996; Sireci, Thissen, & Wainer, 1991; Wainer, Sireci, & Thissen, 1991) because “the testlet 
scores are nominal (or at most semi-ordered) responses; as we show later, a score of 1 may not always 
reflect higher proficiency than a score of 0, due to guessing” (Thissen, Steinberg, & Mooney, 1989). This 
could be the reason that Bock’s nominal model has been used in this situation: polytomous models other 
than Bock’s nominal model assume ordered response categories. 

Under Bock’s (1972) nominal model, the probability that an examinee with a given ability (0 ) 

responds to category k in passage./ is 
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where y'=l, 2 ,...,J (passages), k=l,2,...,K (categories). The constraints, ^ aj k =X c jk > are im P osed on 

k k 

this model. The parameters of this model are rescaled by using centered polynomials of the associated 

scores to represent the cate gory- to -cate gory changes in the a^ and values: a^ — ” 9 ' 

P=1 

P v 

and Cj^ = ^Yj p (k — — , where the parameters, [oc p , y p]j, p = 1,2, Pfor p ^ K, are the free 

p=i 2 

parameters to be estimated from the data (Thissen, Steinberg, & Mooney, 1989). 
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The next procedures for estimating the conditional SEM is similar to the application of the 
dichotomous IRT models. For this procedure, it is necessary to obtain the distribution of number-correct 
raw scores given IRT ability (0 ) under a polytomous model. Hanson (1994) extended the Lord & 
Wingersky (1984) algorithm to polytomous items. The recursive algorithm is (Wang, Kolen & Harris, 
1996): 

For item i, [5] 

P,(X= x|0) = P(U, = x|0) x = 0. 1,2 ni 

For item k=2,3,4,...,K, 

n k k 

P k (X = x|e ) = X p k-l (X = x - u)P (U k = u|0 ) X = 0, 1. 2,..., X • 
u=0 k=I 

In Equation [5], the represents a random variable for the score on item k with scores from 0 
to nk- The appropriate probabilities can be obtained from Equation [4] . The variance of the resulting 
distribution is the conditional error variance of the number-correct raw scores for ability 0 , and the 
conditional SEM for a given 0 can be estimated by taking the square root of this conditional error 
variance. 

GIRT Method 

In this study, Samejima’s (1969) graded response model was used, as well as Bock’s (1972) 
nominal model, in order to check on the possibility of using polytomous IRT models based on ordered 
categories. Samejima’s (1969) graded response model seems appropriate for estimating conditional 
SEMs. There would be an ordered quality to testlet-based scores if such scores corresponded to the 
extent of completeness of the examinee’s reasoning process within a specific testlet. This seems to be a 
reasonable representation for reading comprehension testlets, where several dichotomous items relate to 
a single reading passage. The more of such items within a testlet that an examinee answers correctly, 
the more extensive is his or her comprehension. Therefore, in the present study, Samejima s (1969) 
graded response model was compared to Bock’s (1972) nominal model with respect to performance in 
estimating the conditional SEM for tests composed of testlets (Lee, Kolen, Frisbie & Ankenmann, 1998). 
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Under Samejima’s (1969) graded response model, consider passage;, in which the number- 
correct score corresponding to the dichotomous items that constitute the passage can be classified into 
one of K categories, numbered 1 through K inclusive with consecutive integers, and call such a response 
a 'graded response'..." (p.20). Then, the probability that a graded response to passage; is classified into 
category k or higher, given 0 , is 



1 



Pjk(O) = 



1 

1 + exp[-aj(0 - bj k-j)] 



[0 



k= 1 

2<k<K 
k > K 



[ 6 ] 



The parameter aj is the passage discrimination parameter, which is constant across the 
response categories of a particular passage (i.e., constant throughout the whole reasoning process). The 
bj k _ j is the difficulty parameter of the category boundary k-1 (2 < k < K ) for passage ;', and it is free to 
vary among the category boundaries of a particular passage such that bj ; k-l < bj,k • (Note that bj k _] is 
the 0 -value at which the probability of the response being classified into category k or higher is 0.5.) The 
probability that a graded response is classified in category k, given 0 , is defined by 
Pjk(9) = PjkC®) “ Pjjc+l (0) > which is also written as 



Pjk( 9 ) = < 



1 - • 



1 

1 + exp[-aj(0 — b j i )] 
1 



1 



1 + exp[-aj (0 - bj k-i)] l + exp[-aj(0 -b jk )] 
1 



1 + exp[-aj (0 - bjk-i)] 



k = 1 



2 < k< K- 1 



k = K 



[7] 



The examinee’s number-correct score distribution can be obtained by using Equations [7] and [5], Then, 
the conditional SEM for a given 0 can be estimated by following the same procedures that are used in 
the NIRT method. 
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px(I:H) Method 

The univariate px(i:h) design, persons (p) crossed with items (i) nested in passages (h), is 
appropriate for estimating the conditional SEM for this situation. The linear model for the response of a 
person to an item within a passage treats persons as objects of measurement and items and passages as 
random facets. For this model, n p persons represent a random sample from a population of interest, and 
lift passages represent a random sample from the universe of passages. The ilj^ items in a passage are 
also considered a random sample from the universe of items related to that passage. This linear model, 
referred to as completely random, can be represented as: 



Xpih — M- 

+Bp -n 
+Bh -n 
+m:h-^h 

ph M-p ~ Bh ^ B 
“^Xpih M- ph i:h + Bh 
where p=l, ... ,ilp ; i=l, ... ; and h=l, 



(grand mean) 

(person effect) 

(passage effect) 

(item within passage effect) 

(person by passage interaction effect) 
(residual effect) 

• > n h • 



[ 8 ] 



A linear model for a D-study design is the same as in Equation [8], except for replacing i and h 
with I and H, respectively, in all terms containing i and h. Then, as Brennan (1996) has shown, the 
absolute conditional SEM can be computed by 



°(A) p 



| a 2 (h) p " a 2 (i:h) p 

I H I + 



, where I + = 




h 



[9] 



where H represents the number of passages (or testlets) in the D-study. The 1^ represents the number 
of items within the hth passage in the D-study. 
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Simulations 

Model for simulations 

Response data sets of tests composed of stimulus-based testlets (e.g., Reading Comprehension 
tests) were simulated. Nandakumar (1991) provided a method of simulating a paragraph comprehension 
test data set. According to her method, k items of a paragraph comprehension test are split into h groups 
of items. Two abilities are considered to have influence on the examinee’s response to each item: one is 
common to all items of the test (denoted as 0 g in this paper) and the other is unique to each group 
(denoted as 0^, h=l, 2, 3, ... , H, where H represents the number of passages in this paper). That is, the 
examinee’s response to a certain item is influenced by general ability (0g) and passage-specific ability 
(0^). For example, if there are H passages in a test, H+l (“1” represents a general ability influencing all 
items in the test) abilities would be considered. These H+l abilities are assumed to be independent, 
standard normal random variables. She also introduced a bivariate extension of the unidimensional 
three-parameter logistic model with compensatory abilities: 

1 ^ [19 
P i( 0 g’ 0 h) C ' + 1 +exp{-1.7[a gi (0g- b gi ) + a hi (0h-b h i)]} 

where Pi(0g,0h) is the probability that an examinee having 0g and 0^ ability scores answers 
item i correctly, 

a g j and a^j are the discrimination parameters of item i for general and passage-specific 
ability dimensions, respectively, 

b • and b^j are the difficulty parameters of item i for general and passage-specific ability 

dimensions, respectively, and 

Cj is the lower asymptote parameter of item i. 

For simulating the data set for this study, the parameters shown in Equation [10] need to be 
selected. The item difficulty parameters b gi and bhi were taken from independent, identical normal 
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distributions. The item discrimination parameters agj and a^j were generated using the following 
equations from Nandakumar ( 1991 ): 



11 



a gi ~ N{( 1 - ^ )jx , [11] 

a h i~N{^,V5a> 

a g i+a hi ~N(^,a} 

where \X and <7 represent the mean and the standard deviation, respectively, of the discrimination 
parameter for a test. The £, can be interpreted as the degree of influence of each passage-specific ability 
relative to the general ability on an item. For example, if £, is equal to zero, then the examinees 
response depends upon only the level of general ability. As the value of the £, increases, the influence of 
the passage-specific ability increases. Consequently, the conditional dependence among items within 
passages would increase. In this way, it is possible to manipulate the level of conditional dependence 
among items within passages by specifying different values of £, . 

Procedures for simulating data sets 

The model for simulations discussed so far is based on a two-dimensional IRT approach with 
compensatory abilities. In this model, passage-specific abilities were considered as one factor influencing 
an examinee’s response, with general ability being another factor. The conceptualization of this model 
treats passages as a fixed facet, not a random one. However, this conceptualization is different from the 
one adopted in this paper. Previously in this paper, the passage facet was considered a random facet. In 
order to incorporate this different conceptualization about the test into the Nandakumar ( 1991 ) 
procedures, the data were simulated as follows: 

Step 1. Specify a test composed of testlets. 

1-1 Fix the total number of items, k (e.g., k-A 2 ). 

1-2. Split k items into h groups of items (e.g., hj= 6 , /i9=6, /i^=6, /i^=6, h 5=6, /i£=6, 7=6). 




14 



12 



Step 2. Generate a population of persons based on general ability. Select n examinees randomly 
from the general ability scale, 0g , assuming 0g is distributed as standard normal (e.g., 
7i=1000). 

Step 3. Specify a test form. 

Generate the item parameters from the distributions defined in Equation [11]. 

Step 4. Generate passage specific abilities for each examinee of the generated population for a 
specified test form. 

For each selected examinee, generate passage-specific abilities on the scale 0h> h-1, 2, 3, 

, H, assuming each 0^ being independently distributed standard normal. 

Step 5. Generate a response data set. 

5-1. Compute the probability of a correct answer to item i for each examinee using 
Equation [10], Then, create a matrix A, which is composed of n rows (representing 
examinees) by k columns (representing items) using computed probabilities. 

5-2. Generate random numbers from a uniform distribution U(0, 1) and create a matrix B, 
which consists of elements with the dimensions of n x k. 

5 . 3 . From a comparison of elements between matrices A and B, generate matrix C, 
composed of 0 or 1. Assign 1 to cij, if bij is equal to or less than aij, and otherwise, 
assign 0 to cij. 

From these steps, an examinee’s response data set consisting of 0 and 1 can be obtained. 
Repeating the procedures from step 3 to step 5 would make another examinee’s response data set. In 
these procedures, the general ability of each examinee was fixed (not included in the repeated loop), and 
the passage specific abilities of each examinee were selected from the specified distributions (included in 
the repeated loop). These procedures can be thought of as a modification of the procedures used by 
Nandakumar (1991), permitting the passages to be considered a random facet, not fixed. That is, the 
examinee’s passage-specific abilities were assumed to change across randomly sampled passages. These 
data simulation procedures were required for obtaining the true conditional SEM, which was used as a 
criterion for comparing various estimation methods for tests composed of testlets. 
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Preliminary Analyses for Simulations 

The purposes of doing the preliminary analyses for sets of simulations were: (1) to make the 
simulated data sets as similar as possible to the real data sets and (2) to determine the appropriate £, 
values. The need for doing the second preliminary analysis relates to the two dimensional compensatory 
model that was used in this study. 

The decision between the compensatory and non-compensatory models is a somewhat subjective 
one. It seems reasonable to apply the compensatory model to the testlet situations rather than applying 
the non-compensatory model within the appropriate range of £, values. For example, suppose that a 
certain student takes a reading comprehension test and the first passage deals with the topic of baseball 
games. Also, suppose the value of £, is in a reasonable range (e.g., about 0.3). If that student's reading 
comprehension ability level (general ability in this study) is in the middle score range but passage- 
specific ability for the first passage (baseball) is in the high score range, then the probability that the 
student answers items associated with the first passage would be expected to be slightly higher 
compared to the probability when considering reading comprehension ability (general ability) alone. That 
is, student’s high passage-specific ability could compensate his/her lower general ability on answering a 
given item correctly to yield a score somewhat above the middle range. 

However, the compensatory model has some limitations. For example, assume the same test 
situations and a relatively high £, value (e.g., about 0.7). Even though the student's reading 
comprehension ability is extremely low (e.g., 0g=-3.O), if the passage-specific ability is extremely high 
(e.g., 0h=3.O), a very high probability of correctly answering items belonging to that passage would be 
expected. This case seems somewhat unreasonable and unrealistic. Therefore, even though the 
compensatory model could be more reasonable than the non-compensatory model, it should be used 
under a reasonable range of £, values. Checking this range was the second reason for doing the 
preliminary analyses. 

In explaining simulation procedures earlier, the procedures for selecting |i and G for Equation 
[11] were not described thoroughly, even though it was mentioned that these values are the mean and 
standard deviation of the discrimination parameter. Nandakumar (1991) selected these values from real 
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data sources such as the SAT verbal test battery, the ACT mathematics test battery, the Armed Services 
Vocational Aptitude Battery for auto shop information, and so forth. So for this study , the means, 
standard deviations, and maximum and minimum values of the discrimination, difficulty, and lower 
asymptote parameter estimates based on three-parameter logistic model for several Iowa Tests of Ba sic 
Skills (ITBS) tests composed of testlets are reported in Table 1. 



Insert Table 1 About Here 



The means and standard deviations of item parameter estimates for the grade 7 Reading 
Comprehension test were initially selected as inputs to Equations [10] and [11] for the first step of 
preliminary analyses. (The c parameter in Equation [10] was fixed to 0.2 for all simulated items.) Then, 
the simulation procedures were applied under each £, value specified in Table 2, and the degree of 
dependence measure and general characteristics of simulated data sets are presented in the same table. 



Insert Table 2 About Here 



The ^ values ranged from 0.1 to 0 . 6 , with an interval of 0 . 1 . For Step 1 in Table 2 , the means of 
Yen’s (1984) Q 3 statistics for between- and within-passage item pairs are most similar to those of the 
target (between-passage: -0.022, within-passage: 0.027) for the £, value of 0.5. When the values of £, are 
less than 0 . 5 , the means of between-passage Q3 statistics are similar to each other, but the means of 
within-passage Q3 statistics are different from those of the target with the different C values. The 
positive relationship between the mean of within-passage Q3 statistics and the E, value can be found in 
this table. This result seems to be reasonable, because this positive relationship between conditional 
dependence and the £, value could be explained by the logic embedded in the simulation model used in 
this study. 

However, one important finding can be observed by examining the general characteristics 
between the target and the six simulated data sets. That is, even though the means of the Q3 statistics 
for between- and within-passage item pairs for the £, value of 0.5 are similar to those of the target, the 
mean discrimination parameter of the simulated data set is much smaller than that of the target. 

17 



15 



Furthermore, ther a tendency for the mean discrimination parameter estimates to shrink more, 
compared to that v.ie target, as the value of £, increases. In contrast, the mean of the difficulty 
parameter has a much greater value compared to that of the target. The mean of the lower asymptote 
parameter estimates is slightly higher than that of the target. In sum, the general characteristics of the 
item parameter estimates under the £, value of 0.5 are very different from those of the target. 

Another important check would be to compare the mean and standard deviation of the target and 
simulated data sets. Using the mean and standard deviation of proportion-correct scores would be more 
sensible than using raw scores because the grade 7 Reading Comprehension test and the simulated data 
sets have different total numbers of items. From this comparison, non-negligible differences can also be 
observed. In short, the item parameter estimates and general characteristics of the simulated data set 
under the £, value of 0.5 are too different from those of the target, even though the conditional 
dependence measures from both data sources are similar. 

Based on these results, inputs to Equations [ 10 ] and [11] for simulations were changed by using 
a linear estimation. For example, in Step 1, the Ji and a in Equation [ 11 ] were assumed as 0.952 and 
0.287 which were derived from Table 1 , but in Step 2 , they were modified to be 1.630 and 0.553, 
respectively, by setting linear equations of (0.952:0.556 = ? : 0.952) and (0.287:0.149 = ? :0.287). The other 
parameter specifications for simulations were computed by using the same linear estimation method. 

The simulation procedures were applied with the new set of parameter specifications. The general 
characteristics of the simulated data sets were computed and are presented in Table 2 under the heading 
of Step 2 . 

In contrast to the results from Step 1 , the means of the Q 3 statistics for between- and within- 
passage item pairs under the £, value of 0.3 are most similar to those of the target. Descriptive statistics 
about the item parameter estimates and general characteristics of simulated data set are much more 
similar to those of the target than are those from Step 1 . Therefore, in this study, parameter 
specifications under Step 2 were used for the subsequent simulation studies. 

The next issue is associated with selecting appropriate specific £, values. Based on the results 
presented in Table 2 , it might be reasonable to set £, values around 0.3 for simulations. The values 
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from 0.2 to 0.4 with an interval of 0.025 (£, values of 0.200, 0.225, 0.250,..., 0.350, 0.375, and 0.400) were 
examined by investigating the conditional dependence measures and preliminary results by applying 
various conditional SEM estimation methods. As a result, 0.275, 0.300, 0.325, and 0.350 were selected for 
the £, values for simulating data responses. The relationship between the specified £, values and 
conditional dependence measures are presented in Table 3. 



Insert Table 3 About Here 

In order to investigate the relationship between specified £, values and the degree of conditional 
dependence, these measures examining the degree of conditional dependence were applied to each 
simulated data set under each specified value of £, . As anticipated, the degree of conditional dependence 
among within-passage items increases, as the ^ value goes up. The interpretation of the results for 
these conditional dependence measures is the same as was given earlier for examining the conditional 
independence assumption for the real data sets. In general, a ^ value of 0.275 can be understood to 
represent somewhat mild violation of the assumptions compared to the real data sets used in this study. 
The £, values of 0.300 and 0.325 provide conditional dependence measures similar to those obtained 
from the real data sets. These two £, values provide for a moderate violation of the assumptions. The 
value of 0.350 provides conditional dependence measures indicating a severe violation of the 
assumptions compared to the real data sets. 

Criterion Indexes for Simulations 

It would be informative and convenient to formulate overall indexes to represent the degree of 
error involved in using each estimation method. First, the error can be conceptualized as the difference 
between an estimate of conditional SEM for each examinee from using a particular estimation method 
and the true conditional SEM for that examinee: 

sem pr - serrip [12] 

where sem p is a true conditional SEM for a person p and sem pr is an estimated conditional SEM for the 
same person p on a particular replication r. This error can be divided into two parts: bias induced by a 
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particular estimation method and the random error over replications. In this study, 50 replications were 
conducted and then these two components were disentangled as: 

[sem pr -sem p ]+[sem p - sem p ] [13] 

where sem p is the average of the conditional SEMs for person p over replications. (In this study, the 
fitted mean, obtained from a polynomial regression, was used for this average value of the conditional 
SEMs.) The first part represents random error, and the second part represents bias associated with 
using a particular estimation method. 

Based on the above conceptualization, three indexes were developed: average root-mean- 
squared error (ARMSE), average root-mean-squared bias (ARMSB), and average standard error of 
estimate (ASEE): 

ARMSE= X (^r - sem p) 2 

ARMSB= Z ( s ® m P “ sem p ) 2 U 4 ] 

ASEE= J^ZZ(^T5^ f 



where P represents the total number of simulees and R represents the total number of replications. One 
advantage of using these indexes is that the variance of total error can be decomposed into two parts: one 
for squared bias and the other for random error variance. That is, the equation ARMSE^ = ARMSB^ + 
ASEE^ always holds. 



Analysis Strategies 

The true conditional SEM was obtained so that it could be compared with estimates using 
various estimation methods applied to the simulated data set. In the previous section, the data 
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simulation procedures were outlined in five steps. In order to get the true conditional SEM for each 
selected examinee, the procedures from Step 3 to Step 5 were repeated the specified number of times. 

For each simulated data set, the total score of each examinee was computed, and then the standard 
deviation for these r total scores for each examinee was computed. Each standard deviation for total test 
scores of each examinee can be thought of as his/her true conditional SEM, if r goes to infinity. In this 
study, data generation procedures were replicated 1000 times, and 1000 (assumed) true conditional 
SEMs for 1000 examinees selected in Step 2 were computed. These true conditional SEMs served as 
criteria for estimates obtained using various item-based or testlet-based estimation methods. 

One more data set was generated using the same simulation procedures to obtain an examinee’s 
response data set. Using this data set, the item-based and testlet-based conditional SEM estimation 
methods were applied. For the G-theory approach, a computer application program (Brennan, 1996) was 
used to estimate the conditional SEM for each pxl or px(I:H) design. For IRT methods, the BILOG 
(Mislevy & Bock, 1990) and MULTILOG (Thissen, 1991) computer programs were used for estimating 
item parameters and ability parameters. The number-correct raw score distribution for given theta 
values was formulated, and the conditional SEM was computed by a FORTRAN90 application program 
written for this purpose. The estimate from each method was then compared with the true conditional 
SEM of each examinee. These comparison procedures were repeated 50 times to control the error of 
estimates that may influence the magnitude of the estimated conditional SEM. From these results, the 
most appropriate method for estimating the conditional SEM for tests composed of testlets was 
determined, and also the most robust method among item-based methods was identified. 

In order to investigate the relationship between conditional dependence and bias in estimates of 
the conditional SEM using item-based methods, the above procedures for comparing various estimation 
methods were repeated under certain prespecified values of £> (0.275, 0.300, 0.325, and 0.350). To make 
the interpretation of £, more meaningful, the relationship between different £, values and level of 
conditional dependence was investigated. The generalizability of the results from analyzing the 
simulated data sets was checked with the real data sets. ~ 
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Results 

Results from Simulations =0.275) 

Figure 1 shows comparisons between the true conditional SEM and the mean of estimated 
conditional SEMs over 50 replications of using each estimation method. The horizontal axis in each 
graph of the figure represents a true score scale, which was computed by averaging the total test scores 
of examinees over 1000 replications, following the steps outlined in previous section. In order to get the 
true conditional SEMs, the standard deviation of the total scores of each examinee over 1000 replications 
was computed, and a curve was fitted to the SEMs of 1000 examinees to obtain the true conditional 
SEM. The mean of the estimated conditional SEMs were obtained by averaging the conditional SEM 
estimates over 50 replications for each estimation method. That is, each method was applied to each 
replication and repeated 50 times. 

Insert Figure 1 About Here 

The pxl method provides estimates of conditional SEM that are similar to the true conditional 
SEM, even though it slightly overestimates the conditional SEM in the middle score range. The 
conditional SEM estimates of the px(I:H) method are similar to the true conditional SEM, but it also 
slightly overestimates conditional SEM in the middle score range. This method also has much larger 
fluctuations within true scores than do the other estimation methods. The DIRT method provides 
smaller estimates of the conditional SEM compared to the true conditional SEM. That is, the DIRT 
method underestimates the conditional SEM of test scores based on testlets. The GIRT and NIRT 
methods provide estimates of the conditional SEM that are similar to each other. In the middle score 
range, the estimates from these two polytomous IRT estimation methods are similar to the true 
conditional SEM, but in the lower and higher score ranges, they overestimate the conditional SEM. 

To get more general trends, the fitted line of conditional SEM estimates of using each method are 
plotted in Figure 2 along with a line for the true conditional SEM. The fitted line of the conditional SEM 
estimates of each method was obtained by applying a polynominal regression technique. In the middle 
score range, all estimation methods except the DIRT method provide similar estimates of conditional 
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SEM. But in the lower and higher score ranges, the GIRT and NIRT methods give higher estimates. This 

overestimation is a little bit greater in the GIRT method compared to the NIRT method in the lower 

score range. The pxl and px(I:H) methods provide almost the same estimates of conditional SEM along 

the true score scale. 

\ 

Insert Figure 2 About Here 

Bias lines, based on the fitted line from each estimation method, and the true conditional SEM as 
a baseline are presented in Figure 3. The bias trends are similar for the pxl and px(I:H) methods. That 
is, both methods provide slightly positive bias in the middle score range. The DIRT estimation method 
gives negatively biased estimates throughout the score scale. Even though the bias lines for both 
polytomous IRT models seem to be more dramatic than the one from the DIRT method, the influence of 
bias in a practical sense would be much greater with the DIRT method compared to the polytomous IRT 
estimation methods. That is, because the distribution of true scores is similar to the normal distribution, 
the bias in the middle score range would be more severe and influential than the bias in the extremes 
due to the larger number of examinees affected. 

Insert Figure 3 About Here 

Discussion so far has focused on the bias introduced by each estimation method in terms of a 
fitted line and did not consider the error of estimates. Figure 4 shows the standard error of estimate of 
using each estimation method. Much larger standard errors of estimate can be found in the px(I:H) 
method compared to the other estimation methods. That is, the px(I:H) method provides fitted 
conditional SEM estimates that are similar to true conditional SEM, but these estimates contain 
relatively large amounts of error. 

Insert Figure 4 About Here 

Three indexes of error associated with each estimation method under four specified £ values are 
presented in Table 4. Under the £ value of 0.275, the pxl method provides the smallest ARMSE. 
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Therefore, if there is a need to estimate the conditional SEM for each person on one administration of a 
test, the pxl method would produce relative small amounts of error. (The GIRT and NIRT methods 
would be similar.) However, the ARMSB of the px(I:H) method is the smallest one, which means this 
method introduces the least amount of bias in estimating conditional SEM for each person. Even though 
this method has the smallest value of ARMSB, it has the biggest value of ASEE. The proportion of 
variance of total error explained by the error of estimate is about 99.3% for the px(I:H) estimation 
method. 

Insert Table 4 About Here 

In comparing the DIRT and polytomous IRT methods, the polytomous IRT methods provide smaller 
ARMSE and ARMSB values, and they provide ASEE values similar to the DIRT method. The NIRT 
estimation method seems to be only slightly better than the GIRT estimation method in the context of 
mild violation of assumptions for measurement modeling. 

Results from Simulations ( t -0.300) 

Comparisons of the true conditional SEM and mean of the estimated conditional SEMs over 50 
replications of using each estimation method are presented in Figure 5. The pxl method underestimates 
the conditional SEM in the middle of the score range. The underestimation of the DIRT method here is 
much more evident compared to the results from the £, value of 0.275 in Figure 1. Both the GIRT and 
NIRT estimation methods provide slightly underestimated conditional SEMs in the middle score range 
and overestimated conditional SEMs in the lower and higher score ranges. The px(I:H) method provides 
estimates of conditional SEM similar to the true conditional SEM, but it has much greater error of 
estimate compared to the other estimation methods. 

Insert Figure 5 About Here 

Insert Figure 6 About Here 
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The fitted line of the true conditional SEM and fitted lines for the estimated conditional SEM using each 
method are provided in Figure 6. The pxl, NIRT, and GIRT methods provide similar estimates of the 
conditional SEM in the middle score range. The px(I:H) method provides the highest conditional SEM 
estimates in the middle score range, while in the lower and higher score ranges, the GIRT and NIRT 
estimation methods do. 

Insert Figure 7 About Here 

Figure 7 shows the bias lines from each estimation method on the true score scale. The px(I:H) 
method provides conditional SEM estimates that are quite similar to the true conditional SEM, even 
though it overestimates a little in both the lower and higher score ranges. The pxl method 
underestimates the conditional SEM in the middle score range (around from 13 to 32). The DIRT method 
underestimates the conditional SEM along almost all the score range. The NIRT method underestimates 
conditional SEM a little bit more in the middle score range than the GIRT method. Much larger standard 
errors of estimate can be identified in the px(I:H) method by comparing the standard error of estimate 
plots among estimation methods presented in Figure 8. These results are very similar to those shown in 
Figure 4 for £, =0.275. 

Insert Figure 8 About Here 
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According to Table 4, the pxl method provides the smallest ARMSE, but the px(I:H) method 
provides the smallest ARMSB. Both the GIRT and NIRT estimation methods provide much smaller 
ARMSE and ARMSB compared to the DIRT method. The GIRT method provides a little bit smaller 
ARMSE and ARMSB compared to the NIRT method, but both methods have similar ASEE values. 

Results from Simulations ( t =0.325) 

Figure 9 shows comparisons between the true conditional SEM and the mean of estimated 
conditional SEMs using each estimation method under the % value of 0.325. Basically, the trends 
observed in this figure are similar to those found in Figure 5 from the E, value of 0.300, except for two 
differences. First, the pxl method provides much smaller estimates of conditional SEM compared to the 
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true conditional SEM along the true score scale. Second, the discrepancy between the true conditional 
SEM and the estimate of conditional SEM from the DIRT method becomes greater when the £, value 
moves from 0.300 to 0.325. The fitted conditional SEM for each method and the true conditional SEM are 
presented in Figure 10. The bias lines of the estimation methods and the standard error of estimate plots 
are provided in Figure 11 and Figure 12, respectively. Compared to bias lines from the £> value of 0.300, 
the G-theory approaches produce different trends. The IRT approaches yield trends of bias lines that are 
similar to those from £, =0.300. 

Insert Figure 9 About Here 
Insert Figure 10 About Here 
Insert Figure 11 About Here 
Insert Figure 12 About Here 

According to Table 4, the px(I:H) method provides a much smaller ARMSB value compared to the 
other estimation methods, but it still has the largest ASEE value. Both polytomous IRT methods provide 
much smaller ARMSE values compared to the other estimation methods. 

Results from Simulations ( fc =0.350) 

The results from simulations under the ^ value of 0.350 are presented in Figures 13, 14, 15, and 
16. Similar trends and interpretations can be observed and made as in investigating the results from 
simulations under the £> value of 0.325. The main difference is that the degree of bias increased as the 
value changed from 0.325 to 0.350. According to Table 4, the px(I:H) method provides much smaller 
ARMSB than do the other methods. The GIRT method provides the smallest ARMSE value, even though 
the NIRT method has the smaller ARMSB value. 

Insert Figure 13 About Here 
Insert Figure 14 About Here 
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Insert Figure 15 About Here 
Insert Figure 16 About Here 

Relationship between Degree of Violation of Assumptions 
and Bias in Estimates of the Conditional SEM 
One of the research objectives of this study was to investigate the relationship between the 
degree of violation of the assumptions required by measurement modeling and the amount of bias in the 
estimates of the conditional SEM using item-based methods instead of testlet-based methods. To address 
this objective, bias lines for each of the four specified values of £, are replotted in the same graph, all 
shown in Figure 17, for the purpose of comparison. As discussed in explaining Table 3, the £, values 
have a positive relationship with the degree of conditional dependence. 

Insert Figure 17 About Here 

According to Figure 17, in top left graph for pxl method, bias increases as the £, value goes up 
(ignoring the £, value of 0.275). This finding can be confirmed by the overall indexes in Table 4. The 
ARMSB of the pxl method changes in accordance with the change of the £, values: ARMSBs 0.108, 0.217, 
0.349 and £ values of 0.300, 0.325, 0.350, respectively. The reason for excluding the results from the 
value of 0.275 is that the pxl method has a tendency to overestimate the conditional SEM for 
unidimensional tests (Agresti & Coull, 1998; Lee, Brennan & Kolen, 1998); it overestimates the 
conditional SEM under the situation of the £, value of 0.275. Therefore, the results from the £, value of 
0.275 would not be appropriate for investigating bias trends here. By comparing the bias of the DIRT 
method for specified £, values, it is evident that there is a positive relationship between the degree of 
bias and the degree of violation of assumptions. The values of ARMSB changes 0.254, 0.394, 0.463, and 
0.589 with the change of the £, values, 0.275, 0.300, 0.325, and 0.350, respectively. 

Reducing the Standard Error of Estimate in the px(I:H) Method 
The px(I:H) estimation method provides the smallest ARMSB and the highest ASEE for all 
conditions of the simulations in this study. It provides the highest ARMSE value compared to the other 
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estimation methods. Consequently, this method would not be a good choice for estimating the conditional 
SEM of each examinee on one test administration, even though it introduces the least bias. However, if it 
is possible to reduce the error of estimate of the px(I:H) method, then this method would have an 
important advantage over the other methods in estimating conditional SEMs for tests composed of 
testlets in practical situations. Two techniques could be considered in the practical use of this method. 
One is to use the fitted estimates of conditional SEMs, and the other is to report conditional SEMs at 
only integer score points. 

Brennan (1998) indicated that considerable errors were involved in estimates from px(I:H) 
method and suggested that the fitted estimates be used rather than the unfitted ones. He also argued 
that “this seems especially appropriate when the number of observations within objects of measurement 
is small and the number of objects of measurement is large (p.33).” This situation seems to be applied to 
each replication of the simulations used in this study. The fitted estimates of conditional SEM using a 
quadratic function were computed for each replication for the £, value of 0.325, and the ARMSE, 

ARMSB, and ASEE were calculated. 

Figure 18 shows the comparison between the true conditional SEM and the mean of fitted 
estimates of conditional SEM (fitted px(I:H) method). The fitted px(I:H) method provides the estimates of 
conditional SEM similar to the true conditional SEMs. By comparing this Figure with the top-right graph 
in Figure 9, much less variation of points can be observed. The bias line for the fitted px(I:H) method is 
presented in Figure 19. Based on the comparison with the top-right graph in Figure 11, a little bit larger 
bias can be found, which is mainly due to the overestimation compared to the true conditional SEM. The 
standard errors of estimate of the fitted px(I:H) method are plotted in Figure 20. Much smaller standard 
errors of estimate were obtained by using the fitted estimates of conditional SEMs instead of the unfitted 
ones, which can be confirmed by comparing this figure with the top-right graph in Figure 12. According 
to Table 4, the fitted px(I:H) method produces much smaller ARMSE and ASEE, but larger ARMSB 
values compared to those of the px(I:H) method. Even though the magnitude of ASEE for the px(I:H) 
method decreases by using the fitted estimates of conditional SEMs rather than the unfitted ones, it is 
still the highest value compared to the other methods. Because, from a practical standpoint, it would be 
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sensible to use the fitted estimates of conditional SEMs instead of the unfitted ones, this technique 
seems to be a promising one for reducing the standard error of estimate of the px(I.H) estimation 
method. 

Insert Figure 18 About Here 
Insert Figure 19 About Here 
Insert Figure 20 About Here 

Aggregating the conditional SEM on integer score points is another technique for reducing error 
of estimate. According to the Standards for Educational and Psyc hological Testing (American 
Educational Research Association, American Psychological Association & National Council on 
Measurement in Education, 1985), conditional SEMs should be reported at appropriate, well-separated 
levels or intervals. In this study, the conditional SEM for each integer score point was recalculated by 
grouping examinees based on their true scores. For example, in order to get an aggregated estimate of 
the conditional SEM for the true score of 18, the average of the conditional SEM estimates over 
examinees having true scores between 17.5 and 18.5 was computed. This idea was applied to obtaining 
both true conditional SEMs and the estimates of the px(I:H) method on integer score points, which are 
reported in Figure 21. In this figure, the data sets from the ^ value of 0.325 were used. The px(I.H) 
method provides similar estimates of the conditional SEM on integer score points compared to the true 
conditional SEMs. 

Insert Figure 21 About Here 

Figure 22 shows the bias of the px(I:H) estimation method under these new estimations. This 
representation of bias is very similar to the one in Figure 11. The standard error of estimate of the 
px(I:H) estimation method for each integer score point is given in Figure 23. Much smaller errors of 
estimate are observed compared to those using the conditional SEM estimates of individual examinees, 
which are presented in the top-right plot in Figure 12. Comparing both plots, the errors of estimate 
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decrease from about 0.9 to 0.2. Because the conditional SEMs are provided for integer score points in 
practice by many testing companies, this technique is a promising method for estimating conditional 
SEM for tests composed of testlets. 

Insert Figure 22 About Here 



Insert Figure 23 About Here 



Discussion 

Based on findings of this study, these conclusions are offered: 

First, in general, the item-based estimation methods, both the pxl and DIRT methods, 
underestimate the conditional SEM for tests composed of testlets. However, the pxl method provides 
good estimates of the conditional SEM under mild violation of the assumptions, and this method is more 
robust to the violation of the assumptions compared to the DIRT method. The robustness of the pxl 
estimation method might be due to its tendency to overestimate the conditional SEM for a 
unidimensional test. 

Second, the px(I:H) method introduces the smallest amount of bias, but the largest error of 
estimate. This method seems to be the best estimation method for tests composed of testlets in terms of 
the magnitude of bias. One way to reduce the error of estimate dramatically is to use a quadratic fit, as 
discussed by Brennan (1998). Also, reporting conditional SEMs at well-separated score intervals seems 
to be an efficient way of reducing the error of estimate. 

Third, the GIRT and NIRT methods provide similar estimates of the conditional SEM. Therefore, 
the use of Samejima’s graded response model seems to be as appropriate as Bock s nominal model, at 
least, with respect to performance in estimating the conditional SEM for tests composed of testlets. Both 
methods provide estimates of the conditional SEM that are similar to the true conditional SEM in the 
middle score range, but they overestimate the conditional SEM in the lower and higher score ranges. 
This overestimation might be caused by loss of information when testlet scores are used as the unit of 
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analysis, as indicated by Yen (1993). These methods provide good estimates of the conditional SEMs 
under moderate and somewhat severe violation of assumptions. 

Fourth, the bias of the item-based estimation methods increases as the degree of conditional 
dependence goes up. That is, an increase in the extent of violation of the assumptions required by 
measurement modeling leads to a corresponding increase in bias in the estimates of the conditional SEM 
for tests composed of testlets. 
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Table 1 

Descriptive Statistics of Item Parameter Estimates for Several ITBS Tests Composed of Testlets 





Reading 
Grade 4 


Reading 
Grade 7 


Maps 
Grade 4 


Maps 
Grade 7 


Mean ai 


0.805 


0.952 


0.961 


0.807 


S.D. ai 


0.251 


0.287 


0.330 


0.225 


Max ai 


1.449 


1.748 


1.673 


1.343 


Min ai 


0.384 


0.427 


0.499 


0.436 


Mean bi 


0.355 


0.342 


0.212 


0.782 


S.D. bi 


0.851 


0.960 


0.824 


0.670 


Max bi 


2.059 


2.405 


1.635 


1.952 


Min bi 


-1.039 


-1.776 


-1.534 


-0.309 


Mean cj 


0.163 


0.202 


0.175 


0.194 


S.D. cj 


0.036 


0.052 


0.055 


0.045 


Max cj 


0.248 


0.337 


0.282 


0.320 


Min ci 


0.090 


0.127 


0.094 


0.141 


Note. Reading 


= Reading Comprehension, Maps = 


Maps and Diagrams, Vocab 


= Vocabulary, 



Sim = simulated. 
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Table 2 

Characteristics of Simulated Data Sets for Specified £ Values 



Criterion 


Target 


£=o.i 


£ =0.2 


£=0.3 


£=0.4 


£=0.5 


£=0.6 










Step 1 








Mean of Qx 


Between 


-.022 


-.016 


-.016 


-.019 


-.016 


-.022 


-.020 


Within 


.027 


-.018 


-.006 


.002 


.013 


.033 


.034 


S.D. of Qx 


Between 


.044 


.033 


.034 


.033 


.034 


.033 


.035 


Within 


.061 


.031 


.034 


.033 


.036 


.038 


.046 


Mean 


25.4 


20.9 


20.9 


20.9 


21.5 


20.7 


21.2 


S.D. 


9.08 


7.78 


7.83 


6.73 


5.93 


5.51 


4.83 


Mean of Prop 


.552 


.498 


.498 


.498 


.512 


.493 


.505 


S.D. of Prop 


.197 


.185 


.186 


.160 


.141 


.131 


.115 


Mean of ai's 


.952 


1.068 


1.129 


.715 


.596 


.556 


.521 


S.D. of ai's 


.287 


.335 


.323 


.173 


.186 


.149 


.172 


Mean of bi's 


.342 


.784 


1.013 


.771 


1.018 


.973 


1.399 


S.D. of bi's 


.960 


.778 


.720 


1.135 


1.600 


1.316 


1.836 


Mean of ci's 


.202 


.256 


.289 


.219 


.250 


.236 


.280 


S.D. of ci's 


.052 


.061 


.085 


.041 


.053 


.030 


.059 










Step 2 






Mean of Qx 


Between 


-.022 


-.017 


-.019 


-.023 


-.028 


-.032 


-.035 


Within 


.027 


-.015 


-.004 


.024 


.057 


.093 


.118 


S.D. of Qx 


Between 


.044 


.035 


.036 


.035 


.036 


.033 


.035 


Within 


.061 


.047 


.038 


.048 


.044 


.064 


.063 


Mean 


25.4 


22.5 


22.4 


22.3 


21.9 


22.55 


21.7 


S.D. 


9.08 


9.71 


9.40 


8.13 


8.12 


7.19 


6.31 


Mean of Prop 


.552 


.536 


.533 


.531 


.521 


.536 


.517 


S.D. of Prop 


.197 


.231 


.224 


.194 


.193 


.171 


.150 


Mean of ai's 


.952 


1.291 


1.111 


.891 


.834 


.702 


.626 


S.D. of ai's 


.287 


.355 


.343 


.360 


.243 


.209 


.177 


Mean of bi' s 


.342 


.297 


.224 


.486 


.370 


.385 


.664 


S.D. of bi's 


.960 


.915 


.958 


1.003 


.925 


1.206 


1.205 


Mean of ci's 


.202 


.175 


.173 


.199 


.178 


.210 


.221 


S.D. of ci's 


.052 


.034 


.034 


.065 


.033 


.046 


.044 



Note. Target = graded 7 Reading Comprehension test, Mean of Prop = mean of proportion correct 
scores, S.D. of Prop = standard deviation of proportion correct scores. 
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Table 3 

Descriptive Statistics for Qy Statistics for Four Specified £ Values 



Q 3 Statistics 


£ =0.275 


<*=0.300 


<* =0.325 


<* =0.350 


Mean 


Between 


-0.021 


-0.022 


-0.025 


-0.026 


Within 


0.016 


0.022 


0.029 


0.042 


SX>. 


Between 


0.038 


0.042 


0.035 


0.035 


Within 


0.053 


0.055 


0.051 


0.049 
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Table 4 

Average Root Mean Squares of Error (ARMSE), Average Root Mean Square of Bias (ARMSB), and 
Average Standard Error of Estimate (ASEE) for Each Estimation Method for Four Values of £ 



Method 


ARMSE 


ARMSB 


ASEE 


£=0.275 


pxl 


.219 


.096 (19.4%) 


.197 (80.6%) 


px(I:H) 


.832 


.071 (0.7%) 


.829 (99.3%) 


DIRT 


.289 


.254 (77.0%) 


.139 (23.0%) 


GIRT 


.227 


.183 (65.1%) 


.134 (34.9%) 


NIRT 


.222 


.175 (62.5%) 


.136 (37.5%) 


£=0.300 


pxl 


.237 


.108 (20.7%) 


.211 (79.3%) 


px(I:H) 


.844 


.083 (1.0%) 


.840 (99.0%) 


DIRT 


.423 


.394 (86.8%) 


.153 (13.2%) 


GIRT 


.264 


.223 (71.4%) 


.141 (28.6%) 


NIRT 


.275 


.232 (71.3%) 


.147 (28.7%) 


£=0.325 


pxl 


.320 


.217 (46.0%) 


.235 (54.0%) 


px(I:H) 


.886 


.040 (0.2%) 


.885 (99.8%) 


fitted px(I:H) 


.344 


.106 (9.6%) 


.327 (90.4%) 


DIRT 


.496 


.463 (87.1%) 


.179 (12.9%) 


GIRT 


.240 


.173 (52.0%) 


.167 (48.0%) 


NIRT 


.239 


.159 (44.2%) 


.179 (55.8%) 


£=0.350 


pxl 


.425 


.349 (67.4%) 


.243 (32.6%) 


px(I:H) 


.917 


.065 (0.5%) 


.915 (99.5%) 


DIRT 


.618 


.589 (90.8%) 


.188 (9.2%) 


GIRT 


.267 


.195 (53.2%) 


.184 (46.8%) 


NIRT 


.271 


.182 (45.0%) 


.201 (55.0%) 



Note. pxI=G -theory estimation method with a pxl design, px(I:H)=G-theory estimation method with a 
px(I:H) design, DIRT=dichotomous IRT estimation method, GIRT=graded response model estimation 
method, NIRT=nominal model estimation method. The number within parenthesis represents the 
percentage of variation of total error explained by bias or error of estimate. 
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Conditional SEM Conditional SEM 




Figure 1. Comparisons of true conditional standard error of 
measurement and the mean of estimated conditional standard 
errors of measurement over 50 replications using five estimation 
methods and ksi=0.275. 
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Figure 2 . True conditional standard error of measurement and 
fitted conditional standard error of measurement for five 
estimation methods and ksi=0.275. 
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True Score True Score 

Figure 3 . The bias line for each estimation method relative to 
the true conditional standard error of measurement for ksi=0.275. 
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Standard Error of Estimate Standard Error of Estimate 
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Figure 4. Standard error of estimate of each estimation method 
over 50 replications for ksi=0.275. 
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Conditional SEM Conditional SEM 




True Score True Score 

Figure 5. Comparisons of true condtional standard error of 
measurement and the mean of estimated conditional standard 
errors of measurement over 50 replications using five estimation 
methods and ksi=0.300. 
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Figure 6. True conditional standard error of measurement and 
fitted conditional standard error of measurement for five 
estimation methods and ksi=0.300. 
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Fitted Bias Fitted Bias 
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Figure 7. The bias line for each estimation method relative to 
the true conditional standard error of measurement for 
ksi=0 .300 . 
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Figure 8 . Standard error of estimate of each estimation method 
over 50 replications for ksi=0.300. 
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Conditional SEM 




Figure 9 . Comparisons of true conditional standarad error of 
measurement and the mean of estimated conditional standard 
errors of measurement over 50 replications using five estimation 
methods and ksi=0.325. 
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Fitted Conditional SEM 
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Figure 10. True conditional standard error of measurement and 
fitted conditional standard error of measurement for five 
estimation methods and ksi=0.325. 
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Fitted Bias Rtted Bias 




True Score True Score 

Figure 11. The bias line for each estimation method relative to 
the true conditional standard error of measurement for ksi=0.325. 
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Figure 12 . Standard error of estimate of each estimation method 
over 50 replications for ksi=0.325. 
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BEST COPY AVAILABLE 



Conditional SEM Conditional SEM 




Figure 13. Comparisons of true conditional standard error of 
measurement and the mean of estimated conditional standard 
errors of measurement over 50 replications using five estimation 
methods and ksi=0.350. 
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Figure 14 . True conditional standard error of measurement and 
fitted conditional standard error of measurement for five 
estimation methods and ksi=0.350. 
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Fitted Bias Fitted Bias 




Figure 15. The bias line of each estimation method relative to 
the true conditional standard error of measurement for 
ksi=0 . 350 . 
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Figure 16. Standard error of estimate of each estimation method 
over 50 replications for ksi=0.350. 
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Figure 17. Comparison of bias lines for four specified ksi 
values within each of five estimation methods . 
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Figure 18. Comparison between the true conditional standarad 
error of measurement and the mean of estimated conditional 
standard errors of measurement for the fitted px(I:H) method. 
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Figure 19. The bias line for the fitted px(I:H) estimation 
method compared to the true conditional standard error of 
measurement . 
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Standard Error ot Estimate 
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Figure 20. Standard error of estimate of the fitted 
px ( I : H ) method. 
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Figure 21. Comparison between the true condtional standard 
error of measurement and the mean of estimated conditional 
standard errors of measurement for the px(I:H) method using 
only integer score points. 
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Figure 22. Bias of the px(I:H) estimation method 
compared to the true conditional standard error of 
measurement using only integer score points . 
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Figure 23. Standard error of estimate of the px(I:H) 
estimation method using only integer score points. 
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Figure 23. Standard error of estimate of the px(I:H) 
estimation method using only integer score points. 
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Figure 23. Standard error of estimate of the px{I:H) 
estimation method using only integer score points. 
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